In order to approach the above customer questions mentioned above we will take the approach of Data Science as outlined below.
In order to proceed further will go ahead and try to understand various features and insights in the data set.
We will use Azure Machine Learning Service and create Jupyter notebooks to build the model. AML Notebooks provide a convinient way to create and manage notebooks and help to scale if more processing power is required.
We will first import various depedencies that will be required.
import logging
from matplotlib import pyplot as plt
import pandas as pd
import os
import azureml.core
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
print("This notebook was created using version 1.17.0 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")
ws = Workspace.from_config()
output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T
We will upload the dataset to Azure blob Storage and access it using AML Dataset interface. We will create training and validation datasets as well. The prediction label is EmployeeLeft which will be used down the line.
data = "https://anildwablobstorage.blob.core.windows.net/public/EmployeeTurnoverDataset.csv"
dataset = Dataset.Tabular.from_delimited_files(data)
training_data, validation_data = dataset.random_split(percentage=0.8, seed=223)
label_column_name = 'EmployeeLeft'
training_data.to_pandas_dataframe()
We will python integration built into Power BI and to principal component analysis. We will create a Scree Plot using PCA variance ratio and compare across the Principal Components in Power BI. We are also converting categorical features to numeric as part of feature engineering.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import preprocessing
import matplotlib.pyplot as plt
df = dataset.to_pandas_dataframe()
df['Email Domain'] = df['Email Domain'].astype('category')
df['Email Domain Category'] = df['Email Domain'].cat.codes
df.drop('Email Domain', axis=1, inplace=True)
df['Recruiting Location Code'] = df['Recruiting Location Code'].astype('category')
df['Recruiting Location Code Category'] = df['Recruiting Location Code'].cat.codes
df.drop('Recruiting Location Code', axis=1, inplace=True)
df['Recruiting Method Code'] = df['Recruiting Method Code'].astype('category')
df['Recruiting Method Code Category'] = df['Recruiting Method Code'].cat.codes
df.drop('Recruiting Method Code', axis=1, inplace=True)
df['LinkedIn Skill Code'] = df['LinkedIn Skill Code'].astype('category')
df['LinkedIn Skill Code Category'] = df['LinkedIn Skill Code'].cat.codes
df.drop('LinkedIn Skill Code', axis=1, inplace=True)
#print(df.head())
#print(df.shape)
scaled_data = preprocessing.scale(df.T)
pca = PCA() # create a PCA object
pca.fit(scaled_data) # do the math
pca_data = pca.transform(scaled_data) # get PCA coordinates for scaled_data
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]
pca_variance_df = pd.DataFrame(per_var, index=labels,columns=['variance_ratio'])
pca_variance_df['PCA'] = pca_variance_df.index
pca_df = pd.DataFrame(pca_data, index=[df.columns], columns=labels)
pca_df['columns'] = pca_df.index
pca_df
The scree plot shows that principall component PC1 which maps to Current LinkedIn Activity feature is contributing 96% of the variance to EmployeeLeft feature compared to other Principal components.

Now if we visualize 'Current LinkedIn Activity' and 'EmployeeLeft' and slice it by count we get this. This confirms that the 'Current LinkedIn Activity' is related to 'EmployeeLeft' and this could potentially show causality for the employee turnover.

Now if we plot scatter diagram with 'Current LinkedIn Activity' and 'EmployeeLeft', we can see this.

The diagram shows that there is a non-linear relationship between 'Current LinkedIn Activity' and 'EmployeeLeft' and we could use Logistic Regression alogrithm to build the prediction model.
We will proceed with building a machine learning model with Logistic Regression.
We will use sklearn to build logistic regression model.
# Initialize dataset
X_raw = dataset.to_pandas_dataframe()
X_raw.drop('EmployeeLeft', axis=1, inplace=True)
Y = dataset.to_pandas_dataframe()['EmployeeLeft']
X_raw
from sklearn.preprocessing import LabelEncoder, StandardScaler
# Scale the dataset
A = X_raw[['Race (code)','Years of Service']]
X_dummies = pd.get_dummies(X_raw)
sc = StandardScaler()
X_scaled = sc.fit_transform(X_dummies)
X_scaled = pd.DataFrame(X_scaled, columns=X_dummies.columns)
le = LabelEncoder()
Y = le.fit_transform(Y)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(X_scaled,
Y,
A,
test_size = 0.2,
random_state=0,
stratify=Y)
# Work around indexing issue
X_train = X_train.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)
from sklearn.linear_model import LogisticRegression
unmitigated_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)
unmitigated_predictor.fit(X_train, Y_train)
# call the predict function on the model
y_pred = unmitigated_predictor.predict(X_test)
y_pred
X_test
lr_reg_id = register_model("fairness_employeeturover_logistic_regression", unmitigated_predictor)
from sklearn.metrics import confusion_matrix
import numpy as np
import itertools
cf =confusion_matrix(Y_test,y_pred)
plt.imshow(cf,cmap=plt.cm.Blues,interpolation='nearest')
plt.colorbar()
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
class_labels = ['False','True']
tick_marks = np.arange(len(class_labels))
plt.xticks(tick_marks,class_labels)
plt.yticks([-0.5,0,1,1.5],['','False','True',''])
# plotting text value inside cells
thresh = cf.max() / 2.
for i,j in itertools.product(range(cf.shape[0]),range(cf.shape[1])):
plt.text(j,i,format(cf[i,j],'d'),horizontalalignment='center',color='white' if cf[i,j] >thresh else 'black')
plt.show()
We will use fairlearn to understand if there is any bias in the model. We will use these protected classes 'Race (Code) ' and 'Years of Service.' features for now to understand bias. Bias could be introduced into the model due to various factors in the underlying data. For e.g. Race with a particular code could be under-represented and could affect results adversly on the real world datasets by incorrect predicting a particular employee with a specific Race code as leaving or not leaving.
Fairlearn comes with various alogrithms that can detect bias and provide mitigation using certain parity constraints. We will use using GridSearch on Logistic Regression with DemographicParity. DemographicParity works well for binary classification with Race code type of features.
We will create an experiment and upload the fairness metrics for the unmitigated model.
sf = { 'Race': A_test['Race (code)'], 'YearsofService': A_test['Years of Service']}
ys_pred = { lr_reg_id:unmitigated_predictor.predict(X_test) }
from fairlearn.metrics._group_metric_set import _create_group_metric_set
dash_dict = _create_group_metric_set(y_true=Y_test,
predictions=ys_pred,
sensitive_features=sf,
prediction_type='binary_classification')
exp = Experiment(ws, "EmployeeTurnover_Fairness_Unmitigated_Model")
print(exp)
run = exp.start_logging()
# Upload the dashboard to Azure Machine Learning
try:
dashboard_title = "Fairness insights of Unmitigated Logistic Regression Classifier for EmployeeTurnover"
# Set validate_model_ids parameter of upload_dashboard_dictionary to False if you have not registered your model(s)
upload_id = upload_dashboard_dictionary(run,
dash_dict,
dashboard_name=dashboard_title)
print("\nUploaded to id: {0}\n".format(upload_id))
# To test the dashboard, you can download it back and ensure it contains the right information
downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
finally:
run.complete()
We can see that the there is a disparity of 28% in the Race feature which the Fairness Learn algorithm has detected. Also we can see that the Race with code 6 is having maximum disparity of 89.5%. We will now mitigate the bias using GridSearch.

from fairlearn.reductions import GridSearch, DemographicParity, ErrorRate
from sklearn.preprocessing import LabelEncoder, StandardScaler
sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),
constraints=DemographicParity(),
grid_size=71)
sweep.fit(X_train, Y_train,
sensitive_features=A_train['Race (code)'])
predictors = sweep._predictors
errors, disparities = [], []
for m in predictors:
classifier = lambda X: m.predict(X)
error = ErrorRate()
error.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train['Race (code)'])
disparity = DemographicParity()
disparity.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train['Race (code)'])
errors.append(error.gamma(classifier)[0])
disparities.append(disparity.gamma(classifier).max())
all_results = pd.DataFrame( {"predictor": predictors, "error": errors, "disparity": disparities})
dominant_models_dict = dict()
base_name_format = "employeeturnover_gs_model_{0}"
row_id = 0
for row in all_results.itertuples():
model_name = base_name_format.format(row_id)
errors_for_lower_or_eq_disparity = all_results["error"][all_results["disparity"]<=row.disparity]
if row.error <= errors_for_lower_or_eq_disparity.min():
dominant_models_dict[model_name] = row.predictor
row_id = row_id + 1
predictions_dominant = {"employeeturnover_unmitigated": unmitigated_predictor.predict(X_test)}
models_dominant = {"employeeturnover_unmitigated": unmitigated_predictor}
for name, predictor in dominant_models_dict.items():
value = predictor.predict(X_test)
predictions_dominant[name] = value
models_dominant[name] = predictor
from azureml.core import Workspace, Experiment, Model
import joblib
import os
os.makedirs('models', exist_ok=True)
def register_model(name, model):
print("Registering ", name)
model_path = "models/{0}.pkl".format(name)
joblib.dump(value=model, filename=model_path)
registered_model = Model.register(model_path=model_path,
model_name=name,
workspace=ws)
print("Registered ", registered_model.id)
return registered_model.id
model_name_id_mapping = dict()
for name, model in models_dominant.items():
m_id = register_model(name, model)
model_name_id_mapping[name] = m_id
predictions_dominant_ids = dict()
for name, y_pred in predictions_dominant.items():
predictions_dominant_ids[model_name_id_mapping[name]] = y_pred
First configure the fairlearn metrics
sf = { 'Race': A_test['Race (code)'], 'YearsofService': A_test['Years of Service']}
from fairlearn.metrics._group_metric_set import _create_group_metric_set
dash_dict = _create_group_metric_set(y_true=Y_test,
predictions=predictions_dominant_ids,
sensitive_features=sf,
prediction_type='binary_classification')
Upload metrics to AML by creating a new experiment
from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id
exp = Experiment(ws, "Fairlearn_GridSearch_EmployeeTurnover_1")
print(exp)
run = exp.start_logging()
try:
dashboard_title = "Dominant Models from GridSearch"
upload_id = upload_dashboard_dictionary(run,
dash_dict,
dashboard_name=dashboard_title)
print("\nUploaded to id: {0}\n".format(upload_id))
downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
finally:
run.complete()

We will use the TabularExplainer to visualize local features and importance.
We will be using the unmitigated model to understand and visualize explanations.
from interpret.ext.blackbox import TabularExplainer
from azureml.interpret import ExplanationClient
from interpret_community.widget import ExplanationDashboard
# Explain predictions on your local machine
tabular_explainer = TabularExplainer(unmitigated_predictor, X_train, features=X_train.columns.to_numpy())
#client = ExplanationClient.from_run(run)
# Explain overall model predictions (global explanation)
# Passing in test dataset for evaluation examples - note it must be a representative sample of the original data
# x_train can be passed as well, but with more examples explanations it will
# take longer although they may be more accurate
global_explanation = tabular_explainer.explain_global(X_test)
# Uploading model explanation data for storage or visualization in webUX
# The explanation can then be downloaded on any compute
comment = 'Global explanation on regression model trained on EmployeeTurnover dataset'
#client.upload_model_explanation(global_explanation, comment=comment, model_id=original_model.id)
ExplanationDashboard(global_explanation, unmitigated_predictor, datasetX=X_test)
Using the data explorer built into the Explanation dashboard, we can visualize feature importance of both overall model performance and individual data points.

As we can in the visual, Current LinkedIn Activity impacts the 'Predicted Y' axis in both classes - Class 0 (Employee did not leave) and Class 1 (Employee Left)


We shall now see if we were able to address the customer scenario
We have created prediction model using logistic regression that can predict which employee is likely to leave. Using explanations we were to understand that Current LinkedIn Activity Code is most like cause of employee leaving We also looked at potential bias in the model by using fairlearn bias mitigation techniques.
Customer Business leaders would like to understand causality if at all possible especially to understand which variable they should be looking at and if there are any more that would be worthwhile to try to get for future attempts.
Now that we understand Current LinkedIn Activity Code is the leading cause, we can potential investigate further on this feature chalkout next best actions.